Cleansing Databases of Misspelled Proper Nouns
نویسندگان
چکیده
The paper presents a data cleansing technique for string databases. We propose and evaluate an algorithm that identifies a group of strings that consists of (multiple) occurrences of a correctly spelled string plus nearby misspelled strings. All strings in a group are replaced by the most frequent string of this group. Our method targets proper noun databases, including names and addresses, which are not handled by dictionaries. At the technical level we give an efficient solution for computing the center of a group of strings and determine the border of the group. We use inverse strings together with sampling to efficiently identify and cleanse a database. The experimental evaluation shows that for proper nouns the center calculation and border detection algorithms are robust and even very small sample sizes yield good results.
منابع مشابه
Translation Quality Assessment of English Equivalents of Persian Proper Nouns: A case of bilingual tourist signposts in Isfahan
Abstract This study evaluated the translation quality of English equivalents of Persian proper nouns in the tourist signs and bilingual boards in Isfahan. To find different errors in the translations of the bilingual boards and tourist signs, the data were collected directly by taking picture or writing exactly from the available tourist signs and bilingual boards. Then, the errors were assesse...
متن کاملWhite Page Construction from Web Pages for Finding People on the Internet
This paper proposes a method to extract proper names and their associated information from web pages for Internet/Intranet users automatically. The information extracted from World Wide Web documents includes proper nouns, E-mail addresses and home page URLs. Natural language processing techniques are employed to identify and classify proper nouns, which are usually unknown words. The informati...
متن کاملApplication of Proper Nouns as Terms of Address in Russian Compared to their Persian Equivalents
This study delved into the application of proper nouns as terms of address in Russian and Persian. In other words, it examined the rules governing the application of terms of address expressed as the names of individuals in different speech situations in both languages. The comparative study of the cultural features of languages spoken by Russians and Iranians called for the investigation of th...
متن کاملA Novel Method to Evaluate Romanization Systems: The Case of Romanizing Arabic Proper Nouns
The transliteration of Arabic proper nouns to other languages is usually based on the phonetic translation of these nouns into their phonetic Latin counterparts. Most of the dictionaries do not include most of these nouns, although some may have meanings. Transliteration is essential generally to Natural Language Processing (NLP) field and specifically to machine translation systems, cross-lang...
متن کاملTranslation Quality Assessment of English Equivalents of Persian Proper Nouns: A case of bilingual tourist signposts in Isfahan
Abstract This study evaluated the translation quality of English equivalents of Persian proper nouns in the tourist signs and bilingual boards in Isfahan. To find different errors in the translations of the bilingual boards and tourist signs, the data were collected directly by taking picture or writing exactly from the available tourist signs and bilingual boards. Then, the errors were assesse...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006